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Abstract 

Background: Obtaining transcripts of Inomologs of closely related organisms and retrieving the reconstructed 
exon-intron patterns of tlie genes is a very important process during the analysis of the evolution of a protein 
family and the comparative analysis of the exon-intron structure of a certain gene from different species. Due to 
the ever-increasing speed of genome sequencing, the gap to genome annotation is growing. Thus, tools for the 
correct prediction and reconstruction of genes in related organisms become more and more important. The tool 
Scipio, which can also be used via the graphical interface WebScipio, performs significant hit processing of the 
output of the Blat program to account for sequencing errors, missing sequence, and fragmented genome 
assemblies. However, Scipio has so far been limited to high sequence similarity and unable to reconstruct short 
exons. 

Results: Scipio and WebScipio have fundamentally been extended to better reconstruct very short exons and 
intron splice sites and to be better suited for cross-species gene structure predictions. The Needleman-Wunsch 
algorithm has been implemented for the search for short parts of the query sequence that were not recognized 
by Blat. Those regions might either be short exons, divergent sequence at intron splice sites, or very divergent 
exons. We have shown the benefit and use of new parameters with several protein examples from completely 
different protein families in searches against species from several kingdoms of the eukaryotes. The performance of 
the new Scipio version has been tested in comparison with several similar tools. 

Conclusions: With the new version of Scipio very short exons, terminal and internal, of even just one amino acid 
can correctly be reconstructed. Scipio is also able to correctly predict almost all genes in cross-species searches 
even if the ancestors of the species separated more than 100 Myr ago and if the protein sequence identity is 
below 80%. For our test cases Scipio outperforms all other software tested. WebScipio has been restructured and 
provides easy access to the genome assemblies of about 640 eukaryotic species. Scipio and WebScipio are freely 
accessible at http://www.webscipio.org. 



Background 

Whole genome sequences of eukaryotes are generated 
with increasing speed [1]. While the focus at the begin- 
ning of high-throughput DNA sequencing was on model 
organisms and the human genome, for which tremen- 
dous amounts of secondary data was available, the aims 
have shifted to organisms of medical or economic 
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relevance (e.g. Plasmodium falciparum [2] or Phy- 
tophthora ramorum [3]), to the comparative analysis of 
entire taxa (e.g. the Drosophila clade [4] or Candida spe- 
cies [5]), and, very recently, to organisms of evolutionary 
interest (e.g. Trichoplax adhaerens [6] or Volvox carteri 
[7]). However, gene catalogues are only available for a 
small part of the sequenced organisms and a precise 
and complete set of genes is still unavailable for even a 
single species. In the first instance the gene annotation 
is done with automatic gene prediction programs that 
either predict only isolated exons, or reconstruct the 
complete exon-intron structures of the protein-coding 
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genes, or even try to predict 5' and 3' untranslated 
regions. Ab-initio gene prediction programs only use the 
assembled DNA sequences as input, having precom- 
puted models for nucleotide distributions, while evi- 
dence-based programs consider alignments of ESTs, 
cDNAs, or annotated sequences from closely related 
organisms, with the target sequence (reviewed in [8]). 
The highest accuracy is reached by programs that com- 
bine model-based and alignment-based approaches 
[9,10]. 

For many biological applications like the phylogenetic 
analysis of a protein family (e.g. [11]) or the comparative 
analysis of the exon-intron structure of a certain gene 
from different species (e.g. [12]), it is necessary to obtain 
translated transcripts of homologs of closely related 
organisms or the reconstructed exon-intron patterns of 
the genes, respectively. The protein sequences of homo- 
logs of a certain protein can be obtained in several 
ways. Annotations based on ab-initio gene predictions, 
sometimes supplemented by EST data, are available for 
about half of the sequenced eukaryotic genomes, 
although it is often tedious to find the corresponding 
data via the FTP-pages of the sequencing centers. In 
addition, automatic predictions are not complete and in 
many cases not correct. For very few eukaryotes, full- 
length cDNA data can be accessed. However, these data 
never cover the complete transcriptome of the species. 
Another possibility is to manually annotate the protein 
homologs in the genomes of choice by comparative 
genomics. This is certainly the most accurate way. By 
this approach a multiple sequence alignment of as many 
as possible homologs is created, and based on this 
sequence alignment mispredicted sequence regions 
(insertions and missing regions) are easily detected. 
Further homologs are added by manual inspection of 
the corresponding genomic DNA regions and manual 
reconstruction of intron splice sites. Splice sites are in 
most cases conserved throughout the eukaryotes [13] 
and therefore their position and frame can be used for 
gene reconstructions by comparing gene structures from 
known and to be annotated genes. 

To assist in the task of the manual annotation of 
eukaryotic genomes, and to provide options for genomes 
for which gene prediction data is not available, we have 
recently developed Scipio [14,15]. Scipio is a post-pro- 
cessing script for the Blat output [16] and maps a pro- 
tein sequence to a genomic DNA sequence. Blat has 
been developed for the fast alignment of very similar 
DNA or protein sequences. However, Blat is not able to 
identify very short exons (two or three amino acids, or 
exons of just the N-terminal methionine), it is not able 
to assemble genes spread on more than one contiguous 
DNA sequence, it misses exons that are too divergent, it 
does not apply biological sequence models to determine 



exact splice site locations on nucleotide level, or to dis- 
tinguish introns from insertions caused by frameshifts 
or in-frame stop codons [14,15,17]. Scipio is able to 
address most of these issues resulting in considerably 
improved gene structure reconstructions [14,15]. Its 
initial intention was to cope with sequencing errors, to 
assemble genes from highly fragmented genome assem- 
blies, and to reconstruct intron splice sites. Scipio was 
not able to correctly reconstruct very short exons or to 
correctly reconstruct genes in cross-species searches if 
these were not highly identical. 

Here, we present the fundamentally improved version 
of the Scipio software that has been extended for the 
use in cross-species searches. In addition, very short 
exons and divergent regions at intron borders are now 
correctly reconstructed. Scipio can be used via the web- 
interface WebScipio that provides access to 2111 gen- 
ome assembly files for 592 species (end of February 
2011). 

Methods 

The presented software consists of two programs that 
form a pipeline for the output of the external program 
Blat, which is executed first. The Blat results are post- 
processed by the Scipio script written in Perl [18]. 
WebScipio provides a graphical user interface for Scipio 
that we have developed using the web framework Ruby 
on Rails [19,20]. The workflow was optimized to direct 
the user to the necessary input parameters. This was 
implemented with the technique of Asynchronous Java- 
script and XML (AJAX). Visual effects were realized 
with the help of Prototype [21] and script. aculo. us [22] 
that are JavaScript libraries, which are integral parts of 
Ruby on Rails. 

Scipio 

The Scipio Perl script itself, which can also be run stan- 
dalone, has undergone numerous extensions that are 
based on our extensive experience in manual gene anno- 
tation [11,23,24]. The general setup of the script that 
aimed to handle all the various sequencing and assembly 
errors has already been described [14]; here, we present 
an implementation of the Needleman-Wunsch algorithm 
which is the main extension to the previous version. 

The Needleman-Wunsch Algorithm used In Scipio 

In the updated version of Scipio, we use a modified 
Needleman-Wunsch style dynamic programming (DP) 
algorithm to perform an exhaustive search for the best- 
scoring spliced alignment between the query and target 
sequence fragments that were left unmatched by Blat. 
Like the original Needleman-Wunsch algorithm, it cal- 
culates an optimal global alignment between the 
sequences, but it is adjusted to find an optimal spliced 
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alignment between a protein query sequence s and a 
genomic target sequence t. Given the computational 
cost of l^l and \t\, it is executed only on very short 
sequence fragments s and t. We introduce different 
categories of penalties depending on the type of match- 
ing. Any alignment can be represented by a parse <D: a 
collection of pairs of strings (si, ii), (5^, t^), such that 
the aligned sequences are the concatenations: s = S\... s,-, 
t = ti... t,.- A penalty score p(s)^, t/^) is assigned to each 
pair as follows: 

♦ if S/c is a single residue and t/^ a string of length 3 
(codon), then p(sh t/^) = /?MAp(S)!r. t/^) is a match/mis- 
match penalty: 



pMAv{Sk, til) 



0, if tk translates to su 
pMiSM if not 



♦ an insertion penalty p{si^, t^) = pi^s is assigned to 
them if s/^ is a single residue and is empty 

♦ a gap penalty /"(s^, t^) = Pgap is assigned to them if 
t/^ is a codon and 5^ is empty 

♦ a frameshift penalty pisj^, tj^) = p^s is assigned to 
them if tj^ consists of 1 or 2 nucleotides, and S/^ is 
empty or a single residue 

To cover the case of introns, in addition we define 
intron penalties based on the donor and acceptor splice 
sites: 

PiNTRON (ni ...tit) = pDSS (ni"2) + Pint + Pass 

with a constant value j?int for any sequence of nucleo- 
tides and zero splice site penalties if «i«2 = "GT", 
and = "AG". We distinguish two cases: in-frame 
introns, and introns splitting codons: 

♦ if s/c is empty and t/^ exceeds the minimum intron 
length, then pis/,, t^) = PiNTRONitk) is the intron 
penalty 

♦ if S/t is a single residue Ki«2«3i and t)^ = Ki(BK2«3. or 
tj^ = nin2(iin3 with single residues and co a string 
exceeding the minimum intron length, then the pen- 
alty is a combined match/intron penalty: p(sk, t^) = 
;'intron(m) + PMAp(sh «i«2«3)- Here, two different 
penalties are defined (depending on the frame of the 
intron), and thus the minimum of them is taken. 

If (sk, tk) does not satisfy any of these conditions, no 
penalty is defined resulting in an invalid parse. By com- 
bining insertions, deletions, and frameshifts, there is 
always some valid parse for any given pair of sequences. 
The cost of a parse O is the sum of the penalties: /^(O) 



= pis-i, tx) + ... + p{Sr, tr), and we calculate 

p {s, t) = min {p (<I>) — is a valid parse aligning s and t} 

by computing the DP matrix (/Vf,y) containing the 
minimal score for an alignment of the subsequences S[o.. 
y.i] and t[o../-i], using the following recursions: 



A4(i-3)j + pGAP- 

min jM(i-i)j,M(i_2)j,M,|_,,(j_i),M|i_2)(/-i) j +PFS, 
min |Mf'j + piNTRON (t[i'..i-i])) , 

min ^ + Pmap (^[j-i]' hi'ih'-Zi-il) } + PlKlRON (flr+i. i-3]) - 

.^^ min ^ {^('(j-l) +PMAP '['',i'+llfli-l])} + PiNTRON (t[p'+2 .i-2]) 



where each of these expressions corresponds to one of 
the possible penalty types for the last segment of the 
parse. 

The last three lines cover introns, one for each read- 
ing frame, with €min denoting the minimum intron 
length. To avoid having to iterate over all values for i' in 
these cases, we precompute nine variants of the score 
matrix with partial intron penalties added (indexed by a 
nucleotide n if it splits a codon) as follows: 

My' = _ nim [M/j + puss (t[.',.'+ii) | + Pint 

mS]*„ = J j^i'O-i) PDss (t[i'+i,;-+2]) } + Pint 

'IT" 

Myi, = _^rain^^ + Pmap t[i,'ui]nj +Pdss (t[,'+2,i'+3])} +Pint 

Note that « denotes the nucleotide before the intron 
in M^^\ and the nucleotide after it in M^^\ The latter 
contains already the mismatch penalty, while the former 
does not. With i' the latest segment start allowed (/' = i 
- £rnin for an intron scored by M^°\ and i' = i - ^rnin-3 
for a codon split by an intron), the intron variables are 
given recursively by 



Wi'j + pDSS (t[i',i'+ii) + Pint, | 

+ pDSS (l[i'+i,r'+2i) + Pint. mJI'i j-,„ j (« = hn) 



m"'„ = min [«,■,(,_,) +pMAP '[i.'i'+ijn) + Pdss (l[,'+2,ft3i) + Pint, ,,-,„ [ 

and then replace the last three lines in the recursion 
for Mif. 



<|j'+pASs(t[i-2,,-l]), 



!],'„ + pMAP (s[,-i], nt[(_2,i-i]) } + Pass (l[i-4,i-3i) , 
MTL + Pass (t[,-3,i-2]) 



min \M 



The penalties for the Needleman-Wunsch algorithm 
can be adjusted manually in the Scipio command-line 
version but not via the WebScipio web-interface. The 
penalties need to be well balanced so that the Needle- 
man-Wunsch search does not result for example in a 
number of artificial short exons where a long exon is 
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missing due to a gap in the genome assembly. Based on 
extensive tests with in-house test data we set the follow- 
ing values as default: mismatch-penalty: 1.0; insertion- 
penalty: 1.5; gap-penalty: 1.1; frameshift-penalty: 2.5; 
intron-penalty: 2.0 + the respective penalties for donor 
and acceptor splice sites. 

WebScipio 

At present, the web interface offers 2272 genome files of 
643 eukaryotic organisms. Metadata corresponding to 
the species, like assembly versions, sequencing centers, 
and assembly coverage, is available from the diArk data- 
base [25]. WebScipio reads the metadata out of a peri- 
odically updated text file generated from diArk, or 
queries the diArk database directly with SQL. 

The gene structure schemes resulting from the Scipio 
run are generated and displayed in the Scalable Vector 
Graphics (SVG) format [26]. This allows scaling the gra- 
phics while retaining their resolution and to show tool- 
tips generated with JavaScript and HTML for each 
element of the gene structure schemes. For browsers 
not supporting SVG, a fallback solution is implemented, 
which uses the Portable Network Graphics (PNG) for- 
mat. The PNG files are generated by Inkscape [27]. 

Internally, the sequence data is processed with the 
help of BioRuby [28]. Results are saved in the YAML 
format [29], but are also available for download in the 
GFF format. The web application runs the Blat and Sci- 
pio jobs in the background, which was implemented 
using the Rails plug-in Workling in combination with 
Spawn [30,31]. The server-side stored session data is 
increasing with every extension of WebScipio. To make 
the session storage fast, flexible, and scalable we use a 
database backend called Tokyo Cabinet [32]. It offers a 
simple key-value store, also called hash store, for acces- 
sing different data objects with the help of a unique key 
for each object. Tokyo Tyrant is the network interface 
to Tokyo Cabinet and allows storing data across the net- 
work on several servers. It is used in WebScipio for scal- 
ability reasons. 

External Tools 

We use Hoptoad for error reporting [33]. It is a web 
application that collects errors generated by WebScipio, 
aggregates them to the detailed error reports for develo- 
per review, and sends email notifications. We use a 
behaviour-driven testing strategy to validate the func- 
tionality and behaviour of WebScipio. For the automa- 
tion of these tests we use RSpec [34], which is a 
behaviour-driven development framework for the Ruby 
programming language. Our intention for this test 
implementation was the need of reliability and accuracy 
within the continuously extended software. Application 
tests are run with Selenium, a test system for web 



applications [35]. This offers the opportunity to test the 
web-interface as a whole. Selenium integrates into the 
Mozilla Firefox browser as a plug-in that records the 
user interaction in the form of a Ruby script. To run 
the test scripts without user-interaction. Selenium starts 
and controls the browser automatically. We integrated 
the user-interface tests into our automated test environ- 
ment as additional RSpec test cases. 

Results and Discussion 

Scipio and WebScipio workflow, and general parameters 
for fine-tuning gene predictions 

The general workflow of Scipio and WebScipio is simi- 
lar to that described previously [14,15]. Scipio provides 
some general search parameters that filter the Blat out- 
put for further post-processing, and offers several expert 
options that influence the post-processing steps. In the 
new Scipio version, especially the part of the gap-closing 
(mapping the parts of the query sequence to the target 
sequence that Blat failed to recognize) and hit extension 
(modelling the regions at exon borders, including term- 
inal exons, where homology was too low to be identified 
by Blat) has been improved (Figure 1, see also Addi- 
tional file 1). This has been done by implementing the 
Needleman-Wunsch algorithm for the search of 
unmapped query sequence in respective target regions 
and by introducing parameters that allow a higher diver- 
gence from the exon border regions predicted by Blat. 
All new parameters are adjustable by the user although 
the default values should be good enough for most 
cases. However, especially when searching for very 
divergent homologs or when searching for homologs of 
very divergent species, these parameters might need 
manual adaptation. Figure 1 shows a detailed scheme of 
the Scipio workflow including all parameters that can 
manually be adjusted. Also, some of the most important 
decisions are outlined that Scipio makes to provide the 
best possible result. The detailed scheme should allow 
the experienced user to fine-tune the search in espe- 
cially difficult cases. The rationale for implementing 
each of the parameters and its consequences are 
explained below. 

The new web-interface 

Because we wanted to offer most of the new parameters 
to the experienced user via the web-interface WebSci- 
pio, and we planned to introduce searches for alterna- 
tively spliced exons, we had to redesign the WebScipio 
workflow. The goal was to keep it well structured, intui- 
tive and clear. We have also improved the usability for 
new and less experienced users by providing more 
examples, help pages, and documentation. The general 
design of selecting one target sequence for the search 
for multiple query sequences has been retained. Next, 
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Figure 1 The extended Scipio workflow. This diagram depicts tlie activity and data flow of a Scipio run. Scipio needs a protein and a target 
genome sequence, both in FASTA format, as input to start a Blat run. Every single Blat hit is subsequently processed and filtered, and assembled 
in the case of hits on multiple targets. The gapjength describes the number of amino acids of an unmatched query subsequence. The 
intron_length is the corresponding length of the unmatched target subsequence in nucleotides. 
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the experienced user can adjust many of the Scipio vari- 
ables, and, also at this stage, many of the parameters for 
searches for alternative exons (those parameters are 
described elsewhere). We provide some default values 
for cross-species searches that are based on our experi- 
ence in working with and knowledge about eukaryotic 
genomes [11]. For example, some genomes are known 
to contain only small numbers of introns while others 
are known to contain only short introns. Special settings 
for cross-species searches are provided for several speci- 
fic taxa but the default cross-species parameters should 
be applicable for most genomes. Having selected a spe- 
cific set of parameters every single parameter can still 
be adjusted individually. 

As before, the most important result view is the 
scheme of the exon-intron structure of the search result. 
In this scheme, all information regarding the quality of 
the result (complete versus incomplete, containing gaps, 
i.e. unmatched parts of the query sequence, questionable 
introns, mismatches, frame-shifts, in-frame stop-codons, 
etc.) is included. Opening the "Search details" box pro- 
vides further information concerning the search para- 
meters, and additional data regarding the aligned query 
sequence is available from the different result views. 

Due to gene and whole genome duplications during 
eukaryotic evolution there are often two or more closely 
related homologs of a certain protein per genome. This 
might cause some problems for cross-species searches if 
the paralogs in the target genome are about equally homo- 
logous to the query sequence. Therefore, we implemented 
a -multiple results parameter. Switching -multiple re- 
sults off is the best way to get the exact gene structure for 
an intra-species search. Switching -multiple results on 
(default setting in cross-species searches) allows retrieving 
all possible results depending on the general search para- 
meters (like -min score or -min identity, Figure 2). If 
multiple hits are found they will be listed separately and 
can be analysed using the various result views. In addition, 
we implemented a quick view showing the gene structure 
schemes as a fast overview. As example for the benefit and 
limitation of this parameter, we searched for class-II myo- 
sin heavy chain homologs in humans (Figure 2). It is 
known that vertebrate genomes contain several muscle 
myosin heavy chain genes (belonging to the class-II myo- 
sin heavy chains) that are specialised for certain tissues 
like heart muscles or skeletal muscles [11]. Six of these 
genes are encoded in a cluster [36]. The example search 
shows the gene structure corresponding to the query 
sequence (//sMhcl fl) and the gene structures of six 
homologs of varying degree of divergence. While the clo- 
sest homolog {HsMhcl_f[_(l)) only contains mismatches 
compared to the query sequence, the three next closest 
homologs have severely deviating gene structures. They 
contain very long introns in the middle of the genes 



indicating that they are mixed genes assembled from the 
N-terminal half from one gene of the muscle myosin 
heavy chain cluster and the C-terminal half taken from the 
following gene of the cluster. The next two homologs are 
already very divergent so that parts of the genes cannot be 
reconstructed leaving many and long gaps. 

Use of WebScipio to produce publication-quality figures 
of gene structures 

WebScipio can be used to easily produce publication- 
quality figures of gene structures. Either, these figures 
can be produced in the described way, or the user can 
upload an own genomic DNA sequence for use as target 
sequence. This is interesting in the case that the whole 
genome sequence is not known but only the genomic 
sequence of a certain region. SVGs can be downloaded 
and further processed in many graphics programs. 

New general and expert search parameters 

The parameters -min score (previously: -best_size), 
-min identity, and -max mismatch have already been 
described [15] and define the threshold for the Blat hits 
to be processed by Scipio. To reduce or even abolish 
the artificial assembly of contigs that by chance contain 
some identical residues we have introduced the para- 
meter -min coverage that applies to every single Blat 
hit. The coverage is the number of mapped residues (as 
match or mismatch) divided by the query length of the 
(possibly partial) hit. By default, Scipio rejects hits with 
coverage of less then 60%. 

In addition to these general parameters we have intro- 
duced several expert options most of which will be 
described in detail below. One of the parameters is 
-transtable that allows the user to specify a non-stan- 
dard translation table, for the use with species like Can- 
dida species, Tetrahymena thermophila and others that 
would otherwise lead to mismatches. Another parameter 
called -accepted_intron_penalty is used to define valid 
splice sites. By default, GT — AG and GC — AG are 
accepted, whereas, for example, introns with the pattern 
AT — AC would be classified as doubtful ("intron?"). By 
adjusting the -accepted_intron_penalty parameter those 
introns will also be accepted instead of defining those 
introns as "intron?". 

Parameters to account for additional/missing bases in 
predicted exons 

Gene homologs even from very closely related species 
are often too divergent to be completely identified by 
Blat. While the core building block of the proteins and 
the functional sites are often strongly conserved, low 
homology is especially found at the surface of the pro- 
teins. Thus, loop regions are often sites of amino acid 
substitutions, insertions of long stretches of residues, 
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Figure 2 Screenshot of the multiple results view of WebSclpio. The screenshot shows the result of the search for multiple homologs of one 
of the muscle class-ll myosins from human in the human genome. The search parameters were -minjdentity = 60%, -max_mismatch = oo, and 
-multiple_results = yes to get as many homologs as possible. On top, the opened quick view of all reconstructed gene structures is shown. 
Next, a panel with the different results is shown. Green numbers mark complete results (100% of the query sequence reconstructed) while red 
numbers mark incomplete results (might contain gaps, mismatches, frameshifts, etc.). Result hit number 2 was selected and shows the result for 
the closest homolog to the query sequence with no gaps (unmapped query sequence) but 101 mismatches. 
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and deletions. In addition, since the terminal regions of 
most proteins are at the surface, they are also often very 
divergent. Short stretches of nucleotides whose lengths 
are multiples of three and whose translations do not 
result in any in-frame stop codons are most likely to be 
insertions rather than true introns. 

A parameter -min intron len has been implemented 
to distinguish introns from insertions, with a default 
minimum intron length of 22 nucleotides. A minimum 
intron length of 22 nucleotides is a rather conservative 
estimate given the minimum intron length of 35-40 
nucleotides based on a test set of about 17,000 introns 
of genes of 10 model organisms [37]. Thus, by default 
additional coding sequence for up to seven amino acids 
(=21 nucleotides) will be treated as exon sequence and 
joined with the surrounding exons into a single exon. 

The opposite case of extra amino acids in the query is 
dealt with by the parameter -gap_to_close. By default, a 
mapping of up to six additional amino acids from the 
query sequence to the exon borders will be enforced at 
the cost of further mismatches, in order to eliminate a 
gap (of unmatched query sequence). This parameter also 
effects the modelling of the intron borders (see below). 
Figure 3 shows two examples of cross-species searches in 
which the target sequence contains additional or less 
amino acids in conserved exons. Case A shows the results 
of a search for a kinesin homolog from Neurospora 
crassa (query sequence) in the closely related organism 
Neurospora discreta (target sequence, see also Additional 
file 2). Because of the relatively high homology of the two 
sequences, Blat has already retained the additional resi- 
dues of the query sequence so that they are included in 
the result of the old Scipio version. However, a question- 
able intron (called intron? in Scipio) was introduced in 
the region that contained additional nucleotides in the 
target sequence leading to missing residues in the target 
translation. With the new parameter -min intron len 
these additional nucleotides are correctly treated as exo- 
nic sequence. Case B shows an example of two divergent 
homologs of the dynactin p62 gene of Phytophthora 
ramorum (query sequence) and Phytophthora sojae (tar- 
get sequence, see also Additional file 2). These two 
homologs contain a long divergent region with many 
consecutive mismatches in the first exon that is not iden- 
tified by Blat and introduces a long gap of unmatched 
residues. In addition, the N- and C-termini have diver- 
gent sequences and different lengths. With the new para- 
meters, Scipio can correctly model the target gene. 

Parameters to identify divergent exons and very short 
exons ignored by Blat 

To identify exons that contain too many mismatches to be 
identified by Blat, and to correctly annotate very short 
exons, the Needleman-Wunsch algorithm described above 



forces an alignment of unmatched query sequence to 
spare target sequence. Very short exons of one to four 
amino acids are only reconstructed if they are identical to 
the query sequence and contain valid splice sites while 
short exons of five to seven amino acids are also often cor- 
rectly reconstructed if they contain mismatches between 
query and target sequence (e.g. in cross-species searches). 
The maximal lengths of query and target sequence frag- 
ments to be aligned with Needleman-Wunsch are con- 
trolled by the parameters -exhaust_align_size and 
-exhaust gap size, respectively. By default, the exhaustive 
search is restricted to query gaps of 21 amino acides 
(three times the default Blat tilesize), since we expect Blat 
to successfully discover at least parts of any longer exons, 
and to a target subsequence of 15,000 bps. The restriction 
of the latter value is caused by the exponentially increased 
run time with increased target subsequence so that for 
example the potentially very long introns in mammalian 
genomes are only searched after manual increase of this 
value. Other parameters affecting the Needleman-Wunsch 
algorithm, such as the penalties mentioned above, can be 
adjusted by the command line version only, and not via 
WebScipio. However, the default values have extensively 
been tested with in-house data and should not require 
changes in most if not all cases. 

The effect of the new parameters on the search results 
is demonstrated with the examples shown Figure 4 (see 
also Additional file 2). In case A, the human dynactin 
p50 gene contains two very short exons of 3 and 2 
amino acids. These two short exons are conserved in all 
vertebrates (B. Hammesfahr and M. Kollmar, unpub- 
lished data). Case B shows the coronin gene from the 
basidiomycote fungi Puccinia graminis encoding a short 
3 amino acid exon (Figure 4, see also Additional file 2). 
In addition, the codons at the exon/intron junctions of 
this short exon are split. In most of the other basidio- 
mycotes sequenced so far, this short exon is part of one 
of the neighbouring exons, or part of a longer exon that 
includes both neighbouring exons. However, it also 
exists in the basidiomycote Melampsora laricis-popu- 
lina. Thus, this short exon is not an artificial creation 
but a true exon. Case C presents the dynactin pl50 
gene that contains three short exons of 7, 6, and 7 
amino acids at the beginning of the gene (Figure 4, see 
also Additional file 2). Even with the Blat-tilesize set to 
5 those exons are not recognized in the search against 
the chromosome assembly. This example best demon- 
strates the effect of the -exhaust_align_size (default set- 
ting 15,000 bps) and the -exhaust_gap_size (default 
setting 21 aa) parameters to completely reconstruct the 
respective part of the gene. At the 3'-end of the pl50 
gene, there is another very short exon that shows some 
homology to the beginning of the preceding intron and 
is therefore added to the 3'-end of the preceding exon 
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Figure 3 Modelling of additional/missing bases in gene predictions. Case A shows the result of the search of a kinesin from Neurospora 
crassa (query sequence) in Neusrospora discreto (target sequence) using the old and the new Scipio version. The -min_intron_len parameter has 
been set to 22. Case B shows the result of a search of the dynactin p62 homolog from Phytophthora ramorum (query sequence) in Phytophthora 
sojae (target sequence). To get the correct gene prediction the following Scipio parameters have been used: -minjdentity = 60%, -min_score = 
0.3, -max_mismatch = oo, -gap_to_close = 15, -minjntronjength = 22. The colour coding is explained in the legend and applies to all gene 
structure figures. For further information see Additional file 2. 
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Figure 4 Reconstruction of very short exons. Case A shows the result for the reconstruction of the human dynamitin (dynactin p50) gene, 
that contains a 3 amino acid exon and a following 2 amino acid exon that are differentially Included In the final transcript. These exons could 
not be reconstructed with Blat and the old Sclpio version, but using the new Sclpio version that enables Needlman-Wunsch searches. The 
-exhaust_align_slze parameter has been set to 1 5,000 bp because of the length of the Intron. Case B shows the result of the reconstruction of 
the coronin gene from Puccinia graminis f. sp. tritici. The small but evolutlonarily conserved exon 7 can now correctly be reconstructed. Case C 
shows the result of the reconstruction of the mouse dynactin pi 50 gene that contains three short exons of 7, 6, and 7 amino acids close to the 
5'-end of the gene. For the correct reconstruction, the -exhaust_allgn_slze parameter has been increased to 10,000 bp, because of the length of 
the Intron, and the -exhaust_gap_slze has been set to 21 because of the length of the query that could not be mapped. The colour coding of 
the scheme Is the same as In Figure 3. For further Information see Additional file 2. 
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although this results in some mismatches. This beha- 
viour has also been corrected in the new Scipio version 
by some other parameters (see below). 

Genes might not only contain very short exons 
between other exons but also at gene termini. Scipio 
uses an exact pattern search for N-terminal and C-term- 
inal exons. Terminal exons will only be accepted if they 
match the query sequence and if the resulting intron 
borders agree with the two most common splice site 
patterns (GT — AG and GC — AG). The length of the 
terminal exons searched for is limited by the -gap_to_- 
close parameter that is by default six residues. 

Parameters to account for low homology at intron 
borders 

The correct prediction of exact intron borders is one of 
the most difficult tasks in protein-based gene -prediction, 



especially those intron borders next to small exons, 
because their residues might be falsely assigned to 
neighbouring exons, or when homology is low, as in 
cross-species applications. Here, divergent residues at 
intron borders are often not recognized by Blat, or con- 
versely, intronic sequence is falsely assigned to the exon. 
To deal with the latter case, Scipio cuts off the marginal 
parts of Blat matches and realigns them. The parameter 
-max move exon allows increasing the default value of 
six residues that are cut off from the marginal parts. 
Figure 5 shows the effect of this parameter in some 
representative examples (see also Additional file 2). In 
the case of the human class-19 myosin gene, Blat and 
the old Scipio version were not able to reconstruct the 
5'-end of the gene correctly, because the intron in front 
of the second exon of the gene ends with the translated 
sequence LFQ that is very homologous to the real 
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Figure 5 Reconstructing short exons at low homology intron borders. The scheme shows two examples for the reconstruction of short 
exons in regions where the intron borders of the neighbouring exons show some homology to the unmatched query sequence. The value for 
the -max_move_exon parameter has been set to 3 (case A) and 6 (case B), respectively. The colour coding of the scheme is the same as in 
Figure 3. For further information see Additional file 2. 
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sequence LQQ. Blat added these residues to exon 2 
albeit introducing a mismatch. With the new parameter 
-max_move_exon (default setting is 6), Scipio is now 
able to resolve this misalignment and to subsequently 
identify the correct exon 1. Case B shows the recon- 
struction of the actin capping protein a from Theileria 
heterothallica (Figure 5, see also Additional file 2). Here, 
by chance the intergenic region before exon 3 shows 
some homology to exon 2 (3 matches and 3 mis- 
matches) and thus the exon 2 sequence was erroneously 
joined to exon 3. This happened irrespectively of lower- 
ing the Blat tilesize or adjusting any of the other Scipio 
parameters. By setting the -max_move_exon to 6 
(default setting), the new version of Scipio is now able 
to correctly reconstruct the CAPa gene. 

Parameters to adjust searches on chromosomes or highly 
fragmented data 

Scipio is able to reconstruct genes that are spread on sev- 
eral contigs or supercontigs of highly fragmented gen- 
omes. As we have shown, this feature is one of the most 
important strengths of Scipio [14] that other programs 
do not offer. However, this feature is not needed in chro- 
mosomal assemblies, and might lead, especially in the 
case of cross-species searches, to composed hits that 
stretch across multiple chromosomes, one of them being 
false positive (Figure 6). Hence, it can be switched off 
with the parameter -single target hits (or -chromo- 
some), which is the default setting when selecting a chro- 
mosome assembly as genome target file in WebScipio. 



For highly fragmented genomes it is still useful to 
allow gene reconstructions across several contigs. But 
also in this case one would want to exclude the assem- 
bly of hits that would introduce extremely long introns 
between exons on different contigs. To accomplish for 
those cases we have introduced the -max_assemble_size 
parameter that adjusts the maximum size of intron parts 
at target boundaries. If an intron would have to be cre- 
ated between two partial hits across two contigs that 
exceeds the given size (default: 75000 nucleotides), the 
two hits will not appear together as parts of one com- 
posed hit; rather, the lower-scoring contig will be dis- 
carded unless -multiple results is enabled. Alternatively, 
the parameter -min_dna_coverage can be used to limit 
the length of introns stretching across contig bound- 
aries, by specifying a minimum query/target length ratio 
for composed hits, in percent. 

Improved gene structure reconstruction in cross-species 
searches 

To test the sensitivity and specificity of the new Scipio 
version we performed a cross-species search of the 
dynein heavy chain (DHC) genes of Homo sapiens in 
Loxodonta africana. The dynein heavy chain genes have 
been chosen because they belong to the longest genes in 
eukaryotic genomes and thus contain many exons 
spread on several hundred thousands of base pairs 
(Table 1). In addition, the dynein heavy chain family 
members show different degrees of identity in mammals 
and are therefore very suitable to test the limits of 
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Figure 6 Reconstructing genes on chromosome assemblies. The scheme shows an example of the search for the rat homolog (target 
sequence) of the human KifSC kinesin motor protein. The C-terminal about 25 amino acids of the rat KifSC homolog are missing in the 
respective chromosome assembly. Using Scipio vl.O a very short identical stretch of four amino acids, found on a different chromosome, has 
artificially been added to the 3'-end of the gene generating an "intron" of millions of base pairs (Note the scale of the introns!). The new 
parameter -single_target_hits now prevents this mis-assembly. The colour coding of the scheme is the same as in Figure 3. 
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Table 1 Details of the dynein heavy chain genes used for 
the cross-species search 

Protein Homo Loxodonta status* Loxodonta Loxodonta 

Name Length [aa] Length [aa] Length [bp] Exons 

DHC1 4646 4561 P 62248 77 

DHC2 4307 4234 P 413085 89 

DHC3A 4707 4690 ✓ 358204 92 

DHC3B 4624 4582 P 298827 78 

DHC4A 4507 4508 ✓ 361115 82 

DHC4B 4462 4428 P 115251 79 

DHC4C 4486 4339 P 486267 69 

DHC5 4589 4584 ✓ 140290 79 

DHC6 4509 4457 P 134640 86 

DHC7A 4024 4019 ✓ 253605 62 

DHC7B 4070 3966 P 187156 60 

DHC7C 3960 3960 ✓ 201577 73 

DHC8 4265 4064 F 80895 73 

DHC9A 4158 4062 P 302568 75 

DHC9B 4612 4597 P 369531 85 

DHC11 4779 4779 ✓ 81205 43 

* Status: P = partial sequence (short part of the sequence missing); F = 
sequence fragment {large region of the gene missing in the genome); y = 
sequence complete. 

Scipio. Afrotheria (to which the elephants belong) and 
the Euarchontoglires separated about 100 million years 
ago [38]. The DHC query sequence test set and the 
longer time the species have split up should be a better 
test for the cross-species search capabilities of Scipio 
compared to the cross-species search of human myosin 
heavy chain genes in the mouse genome that we per- 
formed earlier [15]. 

Figure 7 shows some example results of the cross-spe- 
cies search with genes of decreasing identity. The class- 
1 dynein heavy chain genes (DHCl) are very conserved 
between mammals, and the Loxodonta DHCl could per- 
fectly be reconstructed (except for the N-terminus that 
is not covered in the genome assembly). The DHC4A 
protein of Loxodonta has about 88 percent identity to 
the human homolog, and could also completely be 
reconstructed. In contrast, the DHC9B protein has only 
about 78 percent identity to the human homolog and 
the reconstructed gene still contains several gaps. The 
figure shows the result of the search using the old Scipio 
version compared to the result of the search with the 
new Scipio version. As reference, the result of the man- 
ual annotation of the gene is shown. It is very obvious 
that the new Scipio version provides a dramatically 
improved reconstruction of the Loxodonta DHC9B gene. 
More than 1,000 additional residues could be mapped 
corresponding to an increase in completeness by about 
25 percent. The number of reconstructed exons 
increased from 62 to 80, which is close to the optimally 
reconstructed number of 85. The diagrams in Figure 8 



show the improvements in gene reconstruction of the 
new Scipio version compared to the old version for the 
complete DHC dataset (see also Additional file 3). The 
reference for the perfectly reconstructed gene is the 
manual annotation based on the comparative annotation 
of more than 2,000 dynein heavy chain genes. The basis 
in diagram A is the reconstruction with Scipio vl.O, and 
shown are the improvements in the completeness of the 
annotation with Scipio vl.5 using different search para- 
meters. In general, with the new Scipio version the 
reconstructions in these cross-species searches could 
considerably be improved. Lowering the tilesize, a Blat 
parameter to search with smaller fragments, further 
improved the results in only two cases. This corre- 
sponds to improvements independent of Scipio. How- 
ever, extending the search frame for the exon search 
with the Needleman-Wunsch algorithm (parameter 
-exhaust_align_size) further completed the reconstruc- 
tion in almost all cases demonstrating the effect of the 
newly introduced Needleman-Wunsch search for short 
or divergent exons. 

Comparison of gene reconstruction and prediction tools 

We compared Scipio to other tools that reconstruct and 
predict genes based on a protein sequence, and to gen- 
eral gene prediction tools. The tools can be ordered in 
three categories. The tools of the first category recon- 
struct the exon-intron structure of the protein-coding 
genes based on a genomic sequence and a provided pro- 
tein sequence. Scipio [14], Prosplign [39], Exonerate 
[40], and Prot map [41] belong to this category. The 
second category includes the tools Fgenesh-h [42], Gene- 
Wise/ Wise2 [43], and GenomeScan [44], that combine 
homology based gene reconstructions taking advantage 
of given protein sequences, and ab initio gene prediction 
approaches. The third group of software packages con- 
sists of ab initio gene prediction tools like Augustus 
[45], Fgenesh [42], and Genscan [46]. The latter tools 
are not really comparable with the other ones in the 
task of reconstructing single genes, but the comparison 
illustrates the differences of ab initio and homology 
based gene predictions. In addition to Blat, we tested 
Blast [47], which can also be used as an initial search 
for the Prosplign tool. However, for our test cases this 
approach did not improve the results of Prosplign (see 
Additional file 4). 

To evaluate the performance of Scipio in comparison 
to the other tools, four test scenarios have been 
designed. The DHC proteins have been chosen as a 
large general test set, while the other examples used for 
the explanation of the new Scipio parameters have been 
used as a test set for genes difficult to reconstruct. Both 
test data sets have been explored in reconstructions/pre- 
dictions out of whole genome assemblies and respective 
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For clarity introns have been scaled down by a factor of 26.02 

Protein query length: 4507 aa, Matches: 3993; Mismatches: 513; Exons: 82; Introns: 81; Intron?: 0; Gaps: 0. 



2300 bps (ex.) 88300 bps (in.) ^ 



/.aDHC9B 

Scipio vl .0 




I scaffoldj 6.45000001 -50000000 (369534bp) 



For clarity introns have been scaled down by a factor of 38.60 

Protein query length: 461 2 aa, Matches: 2630; Mismatches: 61 8; Exons: 62; Introns: 32; Intron?: 4; Gaps: 27. 



3000 bps (ex.) 



86500 bps (in.) 



_ . . , _ n nniiT^TiiBni iii ■ i iin i i iriniiurni ii niiiiiii iiii ii irii 

Scipio vl .5 



IIII 



1 scaffoldj 6.45000001 -50000000 (369543bp) 



For clarity introns have been scaled down by a factor of 29.03 

Protein query length: 4612 aa. Matches: 3288; Mismatches: 895; Exons: 80; Introns: 64; Intron?: 2; Gaps: 14. 



reconstruction of the DHC9B gene from Loxodonto africana 



3300 bps (ex.) 



85700 bps (in. 



/.aDHC9B 

Scipio vl .5 



Dmni 




iillilliiillliliiii H 



1 scaffold_16.45000001-50000000 (369531 bp) 



For clarity introns have been scaled down by a factor of 26.10 

Protein query length: 4597 aa. Matches: 4597; Mismatches: 0; Exons: 85; Introns: 83; Intron?: 1 ; Gaps: 0. 

Figure 7 Example cross-species searches. The results of four searches with dynein heavy chain sequences from Homo sapiens in the elephant 
(Loxodonta africana) genome are shown. All genes are spread on several hundred thousands of base pairs. Statistics to the sequence results are 
given below the gene structure cartoons. An "intron?" is an intron for which the borders do not correspond to the standard splice sites GT — AG 
or GC — AG. The colour coding of the scheme is the same as in Figure 3. 



gene regions. This differentiation has been done because 
only a few of the above-mentioned tools could be used 
in searches against whole genomes due to the limited 
upload possibilities of the respective web-interfaces 
while command-line versions of the tools were not 



available for every software. Thus we tested the perfor- 
mance of all tools against the gene regions of the test 
data that correspond to the nucleotide sequence of the 
reference annotation plus 2,000 additional base pairs 
up- and downstream. To make the execution times 
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120 



100 



Gaps 
lntron?s 




s2 



s4 



Reference 



Scipio 1.0 si 

Figure 8 Diagrams of the improvements introduced with the new Scipio version. The diagrams describe the improvement of the gene 
reconstructions of the DHC genes in the cross-species search of the human homologs (query sequences) in elephant (target sequence) using 
different Scipio versions and parameters. (A) The base-line is the result of the search using the old Scipio vl.O. The maximal possible annotation 
is represented by the gene reconstructions based on the manually annotated elephant DHC genes (reference dataset, purple). The blue bars 
show the reconstruction with Scipio v1.5 using -blat_tilesize = 7, -exhaust_align_size = 500 and -exhaust_gap_size = 21 (dataset si). Green bars 
are results from the second search (dataset s2) with same parameters as for the first search, except for -blat_tilesize = 6 and -exhaust_gap_size 
= 18 (three times the tilesize). This dataset represents improvements independent of Scipio. The red bars represent searches with same 
parameters as for dataset si, except for the increased parameters -exhaust_align_size = 5,000 and -exhaust_gap_size = 25 (dataset s4). This data 
takes far longer to compute compared to the first search, because of the Needleman-Wunsch search in longer regions. For the DHCl gene 
Scipio vl.O maps too many amino acids of the human query sequence to the elephant genome. So the negative bar representing the other 
datasets shows that these datasets cover the right number of 4561 amino acids. (B) This diagram depicts the number of gaps (human query 
sequence not matched in the elephant genome) and questionable introns (intron?; introns with uncommon splice sites) for the searches with 
the old Scipio version and the new version applying different parameters as in (A). The detailed values of the diagrams are shown in tables in 
Additional file 3. 
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comparable, the genome wide runs were performed on a 
dedicated server, which contains four 2.2Ghz AMD 
Opteron 6174 processors, with 12 cores each, and 128 
GByte of memory. 
Scenario 1 

In the first scenario, the tools had to reconstruct the 
dynein heavy chain genes in the whole Loxodonta afri- 
cana genome assembly based on the human protein 
sequences (Table 2). Besides Scipio, only Exonerate and 
Augustus were able to produce reasonable results. 
Prot_map, Fgenesh, and Fgenesh+ could not be tested 
in this scenario because the command-line versions are 
proprietary and it is not possible to upload whole gen- 
ome sequences via their web-interfaces. WebScipio is 
the only tool available, which already provides genome 
sequences. The dynein heavy chain genes contain 1,202 
annotated exons including 209,486 nucleotides. The 
Loxodonta africana genome contains 3,271,792,967 
nucleotides including N's. For the DHCl gene the N- 
terminus cannot be found in the genome sequence 
because of a gap in the genome assembly. We adjusted 
the start of the first known exon in the reference anno- 
tation to the predicted exon for each tool, because the 
start depends on whether a tool found an exon in front 
of the first known exon. The results of the first test sce- 
nario are presented in Table 2 (for more data see Addi- 
tional file 4). Both Scipio and Exonerate in the standard 
mode are comparable in exon sensitivity (93.4% and 
94.8%, respectively) and missed a similar amount of 
exons (11 exons and 6 exons, respectively). However, 
Exonerate predicted many wrong exons (5669 exons) 
resulting in a low specificity (16.5%, compared to 93.3% 
exon specificity by Scipio). Exonerate can be configured 
to report only the best hit by setting the -bestn option 



to 1. While this option increased the specificity (from 
16.5% to 90.2%), the sensitivity decreased (from 94.8% 
to 73.4%). Also, the number of missing exons increased 
to 287. 

Comparing the results of Scipio and Blat illustrates 
that Blat found almost all exons, but that Scipio is 
needed to refine the exon borders as well as to exclude 
hits not related to the query sequence. Using the new 
Needleman-Wunsch algorithm Scipio vl.5 closes many 
gaps by adding and extending exons to the hits found 
by Blat. The number of missing exons is lower in Blat 
(9 exons missing) than in Scipio (11 exons missing), 
because Blat maps parts of the protein sequence to the 
genomic sequence, although these hits are not in the 
same order as in the protein sequence. Scipio excludes 
these hits. The results also show the great improvement 
of Scipio vl.5 compared to Scipio vl.O in sensitivity 
(93.4% and 86.1%, respectively) and specificity,(93.3% 
and 83.2%, respectively). Altogether, these results show 
that Scipio vl.5 is the only free tool that is able to 
reconstruct the genes nearly complete in this scenario. 
Scenario 2 

The results of the second scenario are shown in Table 3. 
All above-mentioned tools were compared except for 
GenomeScan. Although GenomeScan produced results 
with the data provided on the respective webpage it did 
not work with our protein examples. The data show 
that Scipio performed in the same range as the other 
tools with respect to sensitivity and specificity. Scipio, 
Prosplign, and Exonerate revealed the highest sensitivity 
(94.7%, 95.7%, and 94.8%, respectively). Although Pros- 
plign missed only one exon it also mis-predicted 41 
exons. The homology based ab initio tools Fgenesh+ 
and Wise2 also provided almost complete 



Table 2 Test scenario 1 : Reconstruction of the Loxodonta africana dynein heavy chain genes in the whole genome 
sequence based on human protein sequences 



Tool 


Predicted 


Missing 


Wrong 


Exon 


Exon sens. 


Exon 


Nucl. 


Nucl. 


Execution time per 




genes 


exons^ 


exons'^ 


sens. % 


(ov.)^ % 


spec. % 


sens. % 


spec. % 


prot. seq. 


Scipio 1.5" 


16 


11 


6 


93.4 


99.1 


93.3 


98.7 


99.8 


70 m 46s 


Exonerate^ 


2145 


6 


5669 


94.8 


99.5 


16.5 


99.7 


18.5 


123 m 23s 


Exonerate^ 


16 


287 


62 


734 


76.1 


90.2 


76.1 


94.3 


121 m 27s 


Augustus' 


1374928 


0 


3909434 


47.9 


100.0 


0.0 


100.0 


0.3 


> 10 days 


BUM** 




9 


264228 


19.6 


99.3 


0.1 


974 


2.6 


7 m 24s 


Scipio 1.0" 


16 


14 


46 


86.1 


98.8 


83.2 


97.9 


994 


8 m 24s 



^ Number of annotated exons, which are not overlapped by any predicted exon 
^ Number of predicted exons, which are not overlapped by any annotated exon 

^ Number annotated exons, which are overlapped by at least one predicted exon divided by the number of annotated exons 

Mammalia cross species default options {for detailed parameters see Additional file 5) 
^ Parameters: -model protein2genome 
^ Parameters: -model protein2genome -bestn 1 

'' Parameters: -species = human -genemodel = exactlyone (for more parameters see Additional file 5) 
^ Parameters as in "*: -tileSize = 7 -minldentity = 54 -minScore - 15 -oneOff = 1 



Hatje ef al. BMC Research Notes 201 1, 4:265 
http://www.biomedcentral.eom/1756-0500/4/265 



Page 1 7 of 20 



Table 3 Test scenario 2: Reconstruction of the Loxodonta africana dynein heavy chain gene structures in the 
respective gene regions based on human protein sequences 



Tool Pred. genes Missing exons^ Wrong exons'^ Exon sens. % Exon sens, (ov.)^ % Exon spec. % Nucl. sens. % Nucl. spec. % 



Scipio I.S'' 16 


13 


5 


93.1 


98.9 


93.1 


98.6 


99.8 


Scipio 1.5^ 16 


4 


7 


94.7 


99.7 


93.7 


99.2 


99.8 


Prosplign'' 16 


1 


41 


95.7 


99.9 


92.6 


99.9 


98.7 


Exonerate' 32 


7 


6 


94.8 


994 


94.6 


99.6 


99.5 


Exonerate^ 16 


255 


4 


75.7 


78.8 


95.6 


79.2 


99.7 


Prot_map'' 16 


4 


27 


91.7 


99.7 


86.2 


99.3 


99.7 


Fgenesh+^° 16 


10 


10 


94.9 


99.2 


94.8 


99.0 


99.7 


Wise2" 39 


3 


16 


93.3 


99.8 


91.2 


99.7 


98.9 


Augustus^^ 16 


132 


111 


81.9 


89.0 


83.2 


89.9 


88.7 


Fgenesh^° 161 


111 


342 


80.2 


90.8 


67.3 


91.8 


62.3 


Genscan'^ 194 


138 


520 


76.3 


88.5 


57.9 


904 


55.3 


BLAT" 


16 


19 


19.9 


98.7 


19.4 


97.0 


98.9 


Scipio l.O" 16 


16 


10 


86.2 


98.7 


85.9 


97.8 


99.8 



^ Number of annotated exons, which are not overlapped by any predicted exon 
^ Number of predicted exons, which are not overlapped by any annotated exon 

^ Number annotated exons, which are overlapped by at least one predicted exon divided by the number of annotated exons 

^ Mammalia cross species default options {for detailed parameters see Additional file 5) 

^ Mammalia cross species default options; -tileSize = 6 (for detailed parameters see Additional file 5) 

^ Parameters: -full -two_stages 

' Parameters: -model protein2genome 

^ Parameters: -model protein2genome -bestn 1 

^ Similarity: Weak; Search for one best alignment only (for more parameters see Additional file 5) 

Organism: Human 
" Parameters: -both 

Parameters: -species = human -genemodel = exactlyone (for more parameters see Additional file 5) 
Organism: Vertebrate; Suboptimal exon cutoff: 1.00 

Parameters as in '^: -tileSize = 7 -minldentity = 54 -minScore = 15 -oneOff = 1 



reconstructions. Especially Fgenesh-i- achieved high and 
balanced values for sensitivity (94.9%) and specificity 
(94.8%). The number of predicted genes illustrates that 
Exonerate without the -bestn option and Wise2 tend to 
divide long genes (32 and 39 genes predicted, respec- 
tively, instead of 16). The ab initio tools did not show 
comparable performance to the other tools in this sce- 
nario resulting in sensitivities of 76 - 82% and specifici- 
ties of 58 - 83%. Augustus outperforms Fgenesh and 
Genscan with (Table 3) or without (see Additional file 
4) the option to predict exactly one gene. Augusuts with 
the restriction to predict exactly one gene resulted in 
more accurate reconstructions. As in the whole genome 
scenario, the new Scipio vl.5 (93.1% sensitivity and 
93.1% specificity) provides far better gene predictions 
than Blat and Scipio vl.O (sensitivity of 19.9% and 
86.2%, and specificity of 19.4% and 85.9%, respectively). 
Scenario 3 and 4 

In the third and forth scenario we compared the tools in 
their performance to reconstruct the difficult cases, 
which we introduced above by describing the new 



parameters of Scipio vl.5. In scenario 3 a search in the 
whole genome and in scenario 4 a search in the respec- 
tive gene regions (as in scenario 2) was performed. 
Table 4 summarizes the results of the third and forth 
scenario. Only when using the latest version of Scipio 
the genes of the test data set could correctly be recon- 
structed and predicted in the whole genome assemblies 
as well as in the gene region. None of the other tools 
was able to reconstruct all genes correctly, even if the 
gene region was given as in the forth scenario. 

Conclusions 

Scipio and its graphical web-interface WebScipio are 
tools for the reconstruction of gene structures in eukar- 
yotes. Scipio is based on the widely used program Blat 
that has been developed for aligning sequences of very 
high similarity. However, for the correct reconstruction 
of intron splice sites, very short exons, genes spread on 
several contigs, and the handling of sequencing errors a 
lot of post-processing is required. This is done by Sci- 
pio. Here, we present the fundamentally updated 
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Table 4 Test scenario 3 and 4: Difficult cases for reconstruction of gene structures 


Tool 


Ned kinesin 


Phs dynactin p62 Hs dynactin p50 Pug coronin Mm dynactin pi 50 


Hs myosin 


Th CAPa 


Scipio 1.5^ 


/ 


■/ ■/ ■/ •/ 






Prosp ign^ 


0 


0 ✓ - ✓ 




✓ 


Exonerate^ 


0 


0 - - ✓ 


- 




Prot_map* 


0 


0 - - - 


0 


- 


Fgenesli+^ 


0 


0 - - - 


0 


0 


Wise2'' 


0 


0 - - - 






Augustus' 


0 


0 - - - 






Fgenesli^ 


0 


0 - - - 






Genscan^ 


0 


0 - - - 






BLAf 


0 


0 - - - 






Scipio 1.0^ 


0 


0 - - - 







Ned: Neurospora disaeta, Phs: Phytophthora sojae, Hs: Homo sapiens. Pug: Puccinia graminis, Mm: Mus musculus, Th: Thielavia heierothallica 
•/ All exons are reconstructed correctly 

o All annotated exons are matched by or overlap with predicted exons 
- Exons are missing 

^ Ned, Phs: cross species default options; Hs, Mm: default options, exhaust_align_size = 15000; Pug, Th: default options (for detailed parameters see Figures 3, 4, 

and 5 and Additional file 5) 

^ Parameters: -full -two_stages 

^ Parameters: -model protein2genome 

^ Similarity: Weak; Search for one best alignment only (for more parameters see Additional file 5) 
^ Organisms: Ned, Th: Neurospora crassa; Phs: Phytophtora; Hs: Human; Pug: Puccina; IVlm: Mouse 
^ Parameters: -both 

^ Parameters: -genemodel = exactlyone; Organisms: Ned: -species = neurospora; Phs, Pug, Th: -species = generic; Hs, Mm: -species = human (for more 

parameters see Additional file 5} 

^ Organism: Vertebrate; Suboptimal exon cutoff: 1.00 

^ Parameters: -minScore = 15; Ned, Phs: -tileSize = 5 -minldentity = 54 -oneOff = 1; Hs, Pug, Mm, Th: -tileSize = 7 -minldentity = 81 



versions of Scipio and WebScipio, with an improved 
reconstruction of very short exons and intron splice 
sites, especially for the case of cross-species searches. 
To this end, we introduced a version of the Needleman- 
Wunsch algorithm that was shown to find a higher 
number of short exons previously missed, and to correct 
intron boundaries, especially in cases of lower sequence 
similarity. Furthermore, gaps in the mapping are now 
more frequently explained by divergent sequences, 
allowing for longer regions of insertions or deletions 
predicted on the same exon. Several parameters were 
introduced that can be used to fine-tune this behaviour 
if necessary. The sequence similarity between query and 
target sequence decreases with increasing evolutionary 
distance. While Blat is in principle able to locate hits 
for more distant species, the results become more and 
more incomplete, raising the importance of the post- 
processing. We could show that Scipio is now able to 
almost completely reconstruct genes from species 
whose ancestors separated more than 100 Myr ago. 
WebScipio allows easy access to Scipio and genome 
assemblies of about 640 eukaryotic species. This is 
unique to all gene reconstruction/prediction tools avail- 
able and allows easy identification and reconstruction of 



protein homologs in related organisms. We compared 
the performance of Scipio to many other tools using 
our test data. While there are only minor differences in 
the reconstruction of the mammalian dynein heavy 
chain genes between Scipio, Exonerate, Prosplign, and 
Fgenesh-i-, the other software tools were not able to cor- 
rectly reconstruct the more difficult cases encoding very 
short exons and showing strong sequence divergence at 
intron borders or inside of exons. Also unique to Scipio, 
this is the only tool available that is able to correctly 
reconstruct and predict genes that are spread on several 
contigs. 

Availability and requirements 

Project name: WebScipio, Scipio 

Project home page: http://www.webscipio.org 

Operating system: Platform independent 

Programming languages: Ruby, Perl 

Software requirements: Installation of Blat and BioPerl 
for using Scipio as command-line tool. WebScipio has 
been tested with InternetExplorer, Firefox, Chrome, 
Safari, and Opera. 

License: WebScipio and Scipio may be obtained upon 
request and used under a GNU General Public License. 
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Any restrictions to use by non-academics: Using 
WebScipio and Scipio by non-academics requires 
permission. 

Additional material 



Additional file 1: Activity flow of the hit processing step. The 

scheme shows a detailed activity flow of the hit processing step. Here, 
the experienced user can see, where and how the various expert 
parameters modulate Scipio's hit processing, and can thus adjust these 
parameters to get the best result possible. 

Additional file 2: Protein - DNA alignments corresponding to the 
example searches. Here, additional data corresponding to the example 
searches is provided. 

Additional file 3: Table with detailed data of the results of the 
cross-species search of the human DHC genes in the elephant 
genome. The table provides detailed data to the cross-species searches 
including numbers of matches and mismatches, gaps and intron?'s, for 
the searches with different parameters. 

Additional file 4: Detailed evaluation values used for Tables 2, 3, and 
4. This file provides a description of each evaluation parameter and the 
values obtained with each software tool for all sequence predictions. The 
values highlighted in yellow were used for Tables 2, 3, and 4. 

Additional file 5: Software versions and run parameters of the gene 
reconstruction and prediction tools. The tables shows the exact 
versions and run parameters, which were used for the comparison, for 
each scenario. 



List of abbreviations 

Blat: BLAST like alignment tool; FTP: File transfer protocol; HTML: Hypertext 
markup language; SVG: Scalable vector graphics; PNG: Portable network 
graphics; YAML YAML ain't markup language. 
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